Prosper Loan Data Analysis by YALA Nawal

Introduction

I chose Prosper Loan dataset because it describes a new area (financial area) for me and therefore I will learn new things other that I will learn by dealing the red/white wine datasets (datasets seen before in term 1). Prosper marketplace is one of the first Peer-to-peer lending sites. It bring borrowers together on the same website with their loans. It is basically removing the middleman, which is the bank, and creates a mutually beneficial financial arrangement. Prosperdataset contains a large number of variables. The best way to understand them is by describing how a loan process works. Applying for a loan looks something like this:

  • Create your loan listing - you provide basic information, then Prosper will obtain your credit score (ProsperScore) and determine your BorrowerRate and terms.
  • Based on your credit score (ProsperScore) and other information Prosper will obtain, you will be assigned a risk score (ProsperRating), from A to HR.
  • You then create a loan listing which is your request for a loan. You will add a description of your loan purpose and financial situation. It will appear on the platform to be reviewed by investors.
  • Once the loan listing is fully funded and your information has passed Prosper’s verification process (EmploymentStatus, * EmploymentStatusDuration, IsBorrowerHomeowner ect), you will receive your loan.
  • The listing will stay active for 14 days, or until the loan funds.
  • Loan funds are deposited directly into your bank account within days.
  • You begin making your monthly payments.

My analysis is limited to the following variables:

  • ListingCreationDate: The date the listing was created.
  • Term: The length of the loan expressed in months.
  • LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
  • ClosedDate: Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
  • BorrowerRate: The Borrower’s interest rate for this loan.
  • ProsperRating..Alpha.: The Prosper Rating assigned at the time the listing was created between AA HR. Applicable for loans originated after July 2009.
  • ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. *istingCategory : he category of the listing that the borrower selected when posting their listing
  • Occupation: The Occupation selected by the Borrower at the time they created the listing.
  • EmploymentStatus: The employment status of the borrower at the time they posted the listing.
  • EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
  • IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
  • DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • IncomeVerifiable: The borrower indicated they have the required documentation to support their income.
  • LoanOriginalAmount: The origination amount of the loan.
  • LoanOriginationDate: The date the loan was originated.
  • LoanOriginationQuarter: The quarter in which the loan was originated.
  • Investors: The number of investors that funded the loan.
  • MonthlyLoanPayment: The scheduled monthly loan payment.

Univariate Plots Section

Let’s get an overview on the dataset and its structure

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Prosper loan dataset consists of 113937 observations and 81 variables.

## [1] 113937     81

Q: The first question can be asked is what kinds of loans does Prosper make?

Listing Categories are:

##  [1] "0 - Not Available"       "1 - Debt Consolidation" 
##  [3] "2 - Home Improvement"    "3 - Business"           
##  [5] "4 - Personal Loan"       "5 - Student Use"        
##  [7] "6 - Auto, 7- Other"      "8 - Baby&Adoption"      
##  [9] "9 - Boat"                "10 - Cosmetic Procedure"
## [11] "11 - Engagement Ring"    "12 - Green Loans"       
## [13] "13 - Household Expenses" "14 - Large Purchases"   
## [15] "15 - Medical/Dental"     "16 - Motorcycle"        
## [17] "17 - RV"                 "18 - Taxes"             
## [19] "19 - Vacation"           "20 - Wedding Loans"

As we see, the taller bar represents Debt Consolidation category. The most popular purposes for Prosper loans is to pay off credit card debt. “Not available” and “Other”" purposes came after.

Q: Term loans

As we can see, all loans have terms of one, three or five years.

As we can see, number of loans increases across years. We note that at 2009, the SEC ordered Prosper.com to stop making new loans, because federal law prohibits the sale or offer of securities without the company registering with the regulator. This is why we see a decrease in number of loans in this year. We see too that the number of loans in the first and 4th quater is greater than its number in the middle of year.

Q: loan Status

Completed loan status: is a loan that is fully paid off according to the schedule or prepaid. Current loan status: is a loan that is making payments on time and as agreed. ChargedOff loan status: is a loan with 5 missed payments (i.e. 150 days). Defaulted loan status: is a loan thah has a default reason, which is one of the followings:

  • 1 = Delinquency
  • 2 = Bankruptcy
  • 3 = Deceased
  • 4 = Repurchased
  • 5 = PaidInFull
  • 6 = SettledInFull
  • 7 = Sold

Q: Prosper rating and Prosper score

## Warning: Removed 29084 rows containing non-finite values (stat_count).

We see listing Prosper ratings from the best(A) to the worst(HR). The taller bar represents listings without any Prosper Rating (NA). To discover/exploring his story, I facet the plot by ListingCreationYear variable. Prosper score is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.

As we can see in the plot, the bar bresenting no Prosper Rating exists only in 2005,2006,2008 and 2009. After 2009, the N/A bar disappeared. What I think about is the Prosper.com adds Prosper Rating feature after 2009.

Q: Distribution of loan amounts

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Amount loan is limited between 1000 and 35000 dollars. It has a right skewed distribution. We see some descending peaks at 4000, 15000, 20000 and 25000 dollars.

Q: What about borrower rate ?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

BorrowerRate is the Borrower’s interest rate for a given loan. As can we see, it is nearly normally distributed.

Q: How many days take a loan to be be funded?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    9.00   12.34   13.00 1095.00
## Warning: Removed 1828 rows containing non-finite values (stat_count).
## Warning: Removed 2 rows containing missing values (geom_bar).

As we can see, a loan takes by median 9 days to be funded. The hostogram plot is limted at 100 days. There are many outliers in the NumberOfDaysLoanBeFunded data.

Q: Borrower Employment Status

As we can see, most borrowers are employed.

Q: Borrowers Occupation

## 
##                                                        Accountant/CPA 
##                               3588                               3233 
##           Administrative Assistant                            Analyst 
##                               3688                               3602 
##                          Architect                           Attorney 
##                                213                               1046 
##                          Biologist                         Bus Driver 
##                                125                                316 
##                         Car Dealer                            Chemist 
##                                180                                145 
##                      Civil Service                             Clergy 
##                               1457                                196 
##                           Clerical                Computer Programmer 
##                               3164                               4478 
##                       Construction                            Dentist 
##                               1790                                 68 
##                             Doctor                Engineer - Chemical 
##                                494                                225 
##              Engineer - Electrical              Engineer - Mechanical 
##                               1125                               1406 
##                          Executive                            Fireman 
##                               4311                                422 
##                   Flight Attendant                       Food Service 
##                                123                               1123 
##            Food Service Management                          Homemaker 
##                               1239                                120 
##                           Investor                              Judge 
##                                214                                 22 
##                            Laborer                        Landscaping 
##                               1595                                236 
##                 Medical Technician                  Military Enlisted 
##                               1117                               1272 
##                   Military Officer                       Nurse's Aide 
##                                346                                491 
##                        Nurse (LPN)                         Nurse (RN) 
##                                492                               2489 
##                              Other                         Pharmacist 
##                              28617                                257 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                199                               1578 
##                     Postal Service                          Principal 
##                                627                                312 
##                       Professional                          Professor 
##                              13628                                557 
##                       Psychologist                            Realtor 
##                                145                                543 
##                          Religious                  Retail Management 
##                                124                               2602 
##                 Sales - Commission                     Sales - Retail 
##                               3446                               2797 
##                          Scientist                      Skilled Labor 
##                                372                               2746 
##                      Social Worker         Student - College Freshman 
##                                741                                 41 
## Student - College Graduate Student           Student - College Junior 
##                                245                                112 
##           Student - College Senior        Student - College Sophomore 
##                                188                                 69 
##        Student - Community College         Student - Technical School 
##                                 28                                 16 
##                            Teacher                     Teacher's Aide 
##                               3759                                276 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                120                                477 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                951                                102 
##                       Truck Driver                    Waiter/Waitress 
##                               1675                                436
## [1] 68

Borrowers mentioned 68 occupations. Professional (13628) and “other”(28617) are the most borrower occupations.

Q: Borrower Employment status duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625
## Warning: Removed 7709 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Meadian of Borrowers EmploymentStatusDuration is abour 67 months(5.5 years). Mean is at 8 years.

Q:

proportion of Borrower Homeowner is very close to that of not Borrower Homeowner.

Q: Let’s know about the Debt To Income Ratio of the borrower.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Most borrowers have DebtToIncomeRatio in (0,1] interval.

Q: Investors are the persons that funded a loan. Each loan has its number of investors

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

As we can see, many investors can funded one loan.

Q:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this graph, I plot the distribution of EstimatedReturn and EstimatedLoss. The variable values are assigned to the listing (loan before be funded) at the time it was created. They have a normal distrubution.

Univariate Analysis

What is the structure of your dataset?

There are 113937 loans (or listings, the listing become loan when is funded) with 81 variables. There are many factor variables. In this analysis, I intersted to these factor variables: ProsperScore, LoanStatus, ProsperRating..Alpha., LoanOriginationAmount Occupation, EmploymentStatus and IsBorrowerHomeowner.

  • ProsperScore: 1 2 3 4 5 6 7 8 9 10 11.
  • ProsperRating..Alpha.:A AA B C D E HR.
  • IsBorrowerHomeowner: False or True.
  • EmploymentStatus: Employed, Full-time, Not available, Not employed, Other, Part-time Retired, Self-employed.
  • LoanStatus: Completed,Current, FinalPaymentInProgress, Cancelled, Defaulted, Past Due, Chargedoff.
Other observations:
  • Most loans are Current status
  • Most loan purpose is Debt Consolidation.
  • Loan amount is between 1000 and 35000, the mean is at 8337$
  • Loan terms are 1,3 or 5 years.
  • Borrower Rate is between 0 and 0.4975 with a mean of 0.1928.
  • People tend to borrow in the first or last quarter of the year.
  • Most borrowers are employed
  • A loan can be funded by more than one investors.
  • Most of boroowers has a debt to income ratio less or equal 1.
  • A loan takes by median 9 days to be funded.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are ProsperRating..Alpha., ProsperScore, LoanStatus, DebtToIncomeRatio. I’d like to determine which features are best for predicting if a loan will pay-off its loan on time.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

  • I correct the Prosper Rating variable order as AA, A, B, C, D, E, HR instead of this order A AA B C D E HR.
  • I create a LoanOriginationYear, LoanOriginationMonth columns to get the Year of LoanOriginationDate. The LoanOriginationQuarter column contains the year and the quarter of the origination loan, I only preserved the quater part of this value.
  • I create a bucket variable for DebtToIncomeRatio to better present the variable.
  • I create a NumberOfDaysLoanBeFunded variable to calculate number of days a loan takes to be funded.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I just ordered the Loan Status factor variable, to make the bar of “Due past >120 days” loan status coming after the “Due past(90-120 days)” bar in the plot.

Bivariate Plots Section

Q: The introduction of this report stated that based on the ProsperScore and other information Prosper will obtain, a borrower will be assigned a ProsperRating (risk score), from A to HR. Now I w’d like to know how is the relationsip between these two variables: ProsperScore and ProsperRating using scatterplot.

As we can see, high ProsperScore leads to the best ProsperRating and vice versa. 1 ProsperScore is a the worst score lead to a worst ProperRating (E, HR). We see very few points in the ProperRating best region (A and AA). This is because ProsperRating is built depending on other information than the ProsperScore. We note that the plot suffers from some overplotting without jitter.

Q: What other information according to which the ProsperRating is calculated?

We see that despite the borrower is not homeowner can get A, AA and B prosper rating. Therefore the homeowner caracteristic is not used to determine prosper rating.

Q: Is DebtToIncomeRatio used to determine ProsperRating and BorrowerRate?

It is clear that it’s not used to determine Prosper Rating.

## Warning: Removed 8554 rows containing missing values (geom_point).

Nor to determine BorrowerRate

Q: Now, one can ask: Is the 5 years loan term available for all borrowers? In other word, is there a relationship between terms and Prosper Rating?

Loan is assigned by a Prosper rating at the time the listing was created. It takes the sympols: A(best) AA B C D E HR(worst). As we can see, term loan depend on the Prosper rating. Five-year term loan is available at all Prosper Rating levels, but only a three-year term is available on an HR Prosper rating loan.

Q: Let’s know if the ProsperRating affect the BorrowerRate.

loans with worst risk score (Prosper rating) has Higer BorrowerRate; a strong negative correlation. Ok, but why? Can be a way to recover the loan faster? We note that the plot suffers from some overplotting without jitter.

Q: Is there a relationship between Prosper rating and EstimatedLoss

## Warning: Removed 29084 rows containing missing values (geom_point).

A strong negative correlation between EstimatedLoss and ProsperRating. Best prosperRating borrowers have small estimated loss.

Q: Does the loan Term influence the BorrowerRate, let’s see.

What I thought that where the years Term increase the BorrowerRate increase too, like a traditional bank does. This is not what is shown in the box plot. It’s true that BorrowerRate median increase according to number of years in the Term, howeever three years Term has the larger box and range and has some ouliers. As we saw in univariate section, the number of 3 years term loans is the largest. what I think this term is the goal of this business.

Q: Let’s know about BorrowerRate and LoanOriginalAmount.

BorrowerRate range and density decreases as long as LoanAmount increases. I can’t do strong conclusion for these two variables with this plot as it is. I do think that other factor or factors affects the BorrrowerRate. I will explore that in the multivariate section. Otherwise, we can see some points far way to the global density; outliers with very small or 0 BorrowerRate. Let’s know who are :)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -1415.5   287.0   530.0   599.5   964.1  1590.1   58848

I created a new variable NumberOfDaysPayOff to calculate number of days taken by the borrower to pay-off its loan. I used this variable to observe any anomaly in the loans with 0 BorrowerRate.

##   LoanStatus ListingCategory..numeric. NumberOfDaysPayOff
## 1  Completed                         0           32.68085
## 2  Defaulted                         0         1243.91489
## 3  Completed                         3         1118.29787
## 4  Completed                         0         1119.31915
## 5  Completed                         0         1119.31915
## 6  Completed                         1          514.72340
## 7  Completed                         0          214.46809
## 8  Completed                         7          823.14894
##   ProsperRating..Alpha. Term LoanOriginationYear LoanOriginalAmount
## 1                         36                2006               1000
## 2                         36                2007               1900
## 3                         36                2008               1000
## 4                         36                2007               3000
## 5                         36                2006               5000
## 6                         36                2008              25000
## 7                         36                2006               1000
## 8                         36                2008               3000

I tried to write a code to discover from where these outliers came. I selected some variables Which I suspect they caused these data. I can’t conclude ant thing! All 0 borrower rate loan are completed, i.e, the borrowers pay-off thiers loans.

Q: Let’s know about loan amount and Terms.

As was expected, when the loan amount increases the pay-off term also increases and vice varsa. This is true for one year term loan. For the two other terms, we can see that they have praticely the same LoanAmount range and a different mean and median (look at boxplot). What explain the same LoanAmount, a different term? No doubt a third variable.

Q: Does the Debt To Income Ratio influence the BorrowerRate?

There is no correlation between the two variables. It seems that the BorrowerRate is determined in isolation of the borrower financial situation.

Q:

In the above boxplot, we try to looking at how ProsperRating and LoanAmount relate with each other. Loan amount is the amount funded by investors. We note that it may be that the borrower is asking for an amount, but he will not have the full amount requested. As it is shown, the worst score risk loans E and H recive the lesser amounts, this is explained by the small range.

As we see in previous plot, number of loan with debt consolidation purpose is the larget among all other purpose. Likwise, the median of loan amount with this purpose is the biggest too. After come the ‘other’ purpose (8) then the Wedding Loans (20).

Q: Is the DebtToIncomeRatio relate to the Loan amount

## Warning: Removed 51377 rows containing missing values (geom_point).

They don’t relate to each other. I limited the x axis from 0 to 0.25 to explore better the data.

Q:

## Warning: Removed 29084 rows containing missing values (geom_point).

Estimated loss is the estimated principal loss on charge-offs. It is assigned to the listing at the time it was created ( before loan be funded). We can see that as long as the estimated loss increases the loan amount and density (number of loan) decrease.

Q: Does a lower risk score (ProsperRating) can lead more to a Chargedoff loan than the higher risk score?

Pink color in the bar chart represents the proportion of a chargedoff loan according to each score risk(Prosperrating). Whenever we move towards the worst ProsperRating, i.e, towards the higest risk score, the chargedoff proportion increases. This was expected. The scatterplot shows this clearly with borrowers density.

Q: In the plot below, I gathered all loan status that we can say that there is a problem with thiers borrowers in one group. This group does contain Completed, Cancelled, Current and FinalPaymentInProgress status. And I ploted the char bar of prosper rating by the status in the new group.

Despite D prosper rating loans are not the most numerous loan in the data (the C prosper rating loan is), we can observe that D prosper rating delinquent borrowers are more noumerous than any other borrowers even than to E and HR borrowers. This may be due to the fact that thiers prosper rating come in the middle between the good and worst prosper rating and this confuses investors.

Q:

This plot is when I add Term variable. Whtever the term, D prosper rating delinquent borrower exist satically. What I observed is the number of E prosper rating delinquent borrowers decrease when we pass to higer Term. The opposite occurs with A prosper rating delinquent borrowers.

Q:

As we can see, EmploymentStatus doesn’t influence the pay-off/not pay-off of the loan. Where we see that the Chargedoff and Defaulted loan can be the loan status of any borrower emplyement status.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • There is a correlation between ProsperScore and ProsperRating(risk score)
  • Borrower with the worst ProsperRating (HR) can’t take a loan with 5 years term.
  • Borrowers with best ProsperRating have a lower BorrowerRate and vice versa.
  • 3 years Term has the largest BorrowerRate range.
  • There is no linear relationship between BorrowerRate and Term like in traditional bank.
  • There is a correlation between Terms and Loan amount.
  • It seems that the BorrowerRate is determined in isolation of the borrower financial situation.
  • Worst borrower prosper rating recieves a lesser loan amount.
  • Worst ProsperRating leads more to a chargedoff loan.
  • As long as Estimated loss increases the loan amount and density (number of loan) decrease.
  • There is a correlation between EstimatedLoss and ProsperRating.
  • D prosper ratingdelinquent borrowers are more noumerous than any other borrowers even than to E and HR borrowers.

What was the strongest relationship you found?

  • A Strong positive relationship between: ProsperScore and ProsperRating; BorrowerRate and EstimatedLoss.

  • A positive relationship between: ProsperRating and LoanOriginalAmount; Term and LoanOriginalAmount.

  • A strong negative relationship between BorrowerRate and ProsperRating; EstimatedLoss and ProsperRating.

  • A negative relationship between EstimatedLoss and LoanOriginalAmount; BorrowerRate and LoanOriginalAmount.

Multivariate Plots Section

Q: How BorrowerRate, LoanOriginalAmount and ProsperRating ralate to each other?

In this nice plot, we see that a [0, 0.25] BorrowerRate interval is reserved for borrower with the best ProsperRating (A , AA, B and C). Loan amount greater than 25000$ is allowed only to borrowers with A, AA, B prosper rating, where we see only 3 colors representing these prosper ratings.

Q:

In this plot, I tried to explore 3 variables as in previous plot, but this time I replaced LoanOriginationAmount by the Term. This plot says: HR prosper rating borrwers (brown color) cannot have 5 years term. 3 years term is more dense with largest BorrowerRate range. [0, 0.25] borrowerRate interval is reserved for borrowers with best prosper rating.

Q:

Another nice plot :). In addition to what we said on the previous plot, we add that borrowers with AA, A and B prosper rating are only the borrowers that theirs amount loans could exceed 25000$. This plot gathers all what we said ine the 2 previous plot.

Q:

Here, I just replaced BorrowerRate variable by EstimatedLoss. The two variables are strongly related. In this plot, the boundaries between categories and intervals are very clear.

Q: In the bivariate plots section, we saw that despite the E and RH prosper rating borrowers have higher borrowerRate and Estimatedloss than those in D prosper rating borrowrs, the latter are the most numerous. Why? I think the following plots answer us.

In the first graph I verify LoanOriginalAmount for the different ProsperRating. In the second graph, I ploted LOanOriginalAmount by BorrowerRate, and I colored the ProsperRating for only delinqued group. Green Color (D) is dense in this plot. As we can see, the loan amount makes the difference and therfore the MonthlyLoanPayment. Investors defund D perser rating loans almost like they defund B and C. A bad prosper rating with a challenging MonthlyLoanPayment lead to more delinqued borrowrs.

Building a logstic model to predict if a borrower is a futur delinquent or not (after the loan is funded).

To create this model, first I supposed if a loan status is in this list: (Completed, Cancelled, Current, FinalPaymentInProgress) then its borrower is delinquent. I create IsBorrowerDelinquent independent variable.

Second, explanatory variables must not be strongly correlated each other. Therefore I chose EstimatedLoss rather BorrowerRate because the latter determine better the boundaries of prosper rating levels. The other explanatory variable is the LoanOriginalAmount.

## [1] 21213.25
## 
## Call:
## glm(formula = IsBorrowerDelinquent ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8353  -0.4839  -0.3649  -0.2976   2.7290  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.225e+00  8.592e-02 -37.540  < 2e-16 ***
## EstimatedLoss       1.292e+01  5.494e-01  23.515  < 2e-16 ***
## LoanOriginalAmount -2.411e-05  5.161e-06  -4.672 2.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13596  on 21212  degrees of freedom
## Residual deviance: 12629  on 21210  degrees of freedom
## AIC: 12635
## 
## Number of Fisher Scoring iterations: 5
## [1] "EstimatedLoss coef 40.8399034023793"
## [1] "LoanOriginalAmount coef 0.999975890290644"
## [1] "LoanOriginalAmount coef 1.00002411029065"

Interpreting the results of our logistic regression model:

First of alla, we can see that EstimatedLoss and LoanOriginalAmount are statistically significant; thiers p-values are < 0.05. The smaller p-value the more statistically significant. Negatice sign indicates anegative correlation between predictor and independent variables. For one unit increase in Estimated Loss, delinquent borrowers are 40 times as likely, holding all other variables constant. For one unit deccrease in LoanOriginalAmount, delinquent borrowers are 1 time as likely, holding all other variables constant.

## [1] "Accuracy 0.900188560653677"

The 0.90 accuracy on the test set is quite a good result. When I added DebtToIncome variable I get 0.9060. I can’t consider it as a sognificant improvement.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

yes there are. BorrowerRate, EstimatedLoss and ProsperRating are strongly correlated variables to each other. LoanOriginalAmount and ProsperRating, and LoanOriginalAmount and Term are two by tow positively corelated.

Were there any interesting or surprising interactions between features?

Yes, the number of D prosperRating delinquent borrowers are more numerous to the one od E and HR delinquent borrowers.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a logistic linear model starting from the IsDeliquentBorrower, EstimatedLoss and LoanOriginalAmount variables. I tries to add other variables to improve accuracy value but in vain.

Final Plots and Summary

Plot One

Description One

I chose this plot to describe and sums up this busness. BorrowerRate doesn’t vary according to the loan term category. Exactly the opposite happens in traditional bank, where interst rate increases as long as the pay-off term increases. The magority of Borrowers are employed with a debt to income ratio in an interval of ( 0,1]. Borrowers ask more for a loan in 1st and 4th quarter of year.

Plot Two

Description Two

These two plots sums up all what can be said about LoanOriginalAmout, Term, ProsperRating and BorrowerRate. In one side, they show a strong correlation between BorrowerRate, EstimatedLoss and ProsperRating to each other, and a positive correlation between LoanOriginalAmout and Tem/prosperRating in other side.

Plot Three

Description Three

In this plot, I tried to find why D prosper rating delinquet borrowers are more numerous compared to E and HR prosper rating delinquet borrowers, despite E and HR prosper rating delinquet borrowers have higer BorroweRate and EstimatedLoss. I concluded that D prosper rating is a confused prosper rating for investors (lenders). It come between the a zone of good prosper ratings and the worst. Lenders fund better this prosper rating level compared to E and HR prosper rating levels and the highly relatively monthly loan payement become a challenge for borrowers with this level of risk score.


Reflection

Peer to peer lending! This is the first time I hear about this kind of system. Along this journey, I learned new things and practiced what I saw in the previous lessons. At first, the large number of prosper data variables made me confused. But as I went step by step, I started to get used to it. At first I just tried to use 10 to 15 variables, but at the end of this data analysis I found myself looking beyond the 81 variables to build my model :) I started by understanding the provided description of variables. Then I asked myself some questions and I tried to find answers by exploring data with statistical summaries and graphics. There was a clear trend between ProsperRating and BorrowerRate/EstimatedLoss in one side and between delinquent borrowers and EstimatedLoss/BorrowerRate/LoanOriginalAmount in other side. For the logistic linear model, I included LoanOriginalAmount and EstimatedLoss to predict delinquency in borrowers. The model is able to predict delinquent borrowers with an accuracy of 0.90. It is a good value but is not enough. To improve this model, I would to know more about borrower behaviour by asking more question at the time he create his listing.

Reference: